For this homework, datasets are found from UCI Machine Learning Repository and Kaggle.

All of the datasets will be divided into train and test sets. 2/3 of the original data will be the train set and the remaining 1/3 of the original data will be the test set.

The models will be used are as follows.
- Penalized Regression Approaches (PRA)
- Decision Trees (DT)
- Random Forests (RF)
- Stochastic Gradient Boosting (SGB)

Necessary libraries are as follows.

# libraries
suppressMessages(library(readr))
suppressMessages(library(readxl))
suppressMessages(library(glmnet))
suppressMessages(library(Metrics))
suppressMessages(library(rpart))
suppressMessages(library(rattle))
suppressMessages(library(stats))
suppressMessages(library(e1071))
suppressMessages(library(caret))
suppressMessages(library(randomForest))
suppressMessages(library(gbm))

Dataset 1

Fetal Health Classification

  • Description: Reduction of child mortality is reflected in several of the United Nations’ Sustainable Development Goals and is a key indicator of human progress. Cardiotocograms (CTGs) are a simple and cost accessible option to assess fetal health, allowing healthcare professionals to take action in order to prevent child and maternal mortality. The main goal is classifing fetal health in order to prevent child and maternal mortality. Target variable has three classes: Normal, Suspect and Pathological. Also, this dataset contains the observations of Cardiotocograms (CTGs) such as accelerations, decelerations, histogram values.
  • Tasks: Multi-class Classification
  • Number of observations: 2126
  • Number of features: 22
  • Feature characteristics: Integer, real

After reading the dataset, train and test sets are created.

# reading dataset 1
health_data <- suppressMessages(read_csv("/Users/iremarica/Desktop/Homework4/FetalHealth.csv"))
health_data$fetal_health <- as.factor(health_data$fetal_health)

# creating train and test sets for dataset 1
set.seed(1)
health_index <- sample(1:nrow(health_data), (2/3) * nrow(health_data))
health_train <- health_data[health_index, ]
health_test <- health_data[-health_index, ]
paste("Total:", nrow(health_data), "  Train:", nrow(health_train), 
    "  Test:", nrow(health_test))
## [1] "Total: 2126   Train: 1417   Test: 709"

Penalized Regression Approaches (PRA)

For determining the Lasso penalty, lambda, 10-folds Cross Validation is used.

set.seed(2)
health_cv_fit <- cv.glmnet(as.matrix(health_train[, -22]), health_train$fetal_health, 
    family = "multinomial", nfolds = 10)
health_cv_fit
## 
## Call:  cv.glmnet(x = as.matrix(health_train[, -22]), y = health_train$fetal_health,      nfolds = 10, family = "multinomial") 
## 
## Measure: Multinomial Deviance 
## 
##       Lambda Measure      SE Nonzero
## min 0.001162  0.4645 0.03179      13
## 1se 0.006200  0.4942 0.02579       7
cat(" Lambda min:", health_cv_fit$lambda.min, "\n", "Lambda 1se:", 
    health_cv_fit$lambda.1se)
##  Lambda min: 0.00116184 
##  Lambda 1se: 0.006200392
plot(health_cv_fit)

After the CV results, minimum lambda value 0.001328 is selected for using in the Penalized Regression Approach.

health_pra <- glmnet(as.matrix(health_train[, -22]), health_train$fetal_health, 
    family = "multinomial", lambda = health_cv_fit$lambda.min)
health_pra_pred <- data.frame(predict(health_pra, as.matrix(health_test[, 
    -22]), type = "class"))
health_pra_pred$s0 <- as.factor(health_pra_pred$s0)
confusionMatrix(health_pra_pred$s0, health_test$fetal_health)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 508  29   5
##          2  34  71  10
##          3   5   4  43
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8773          
##                  95% CI : (0.8509, 0.9005)
##     No Information Rate : 0.7715          
##     P-Value [Acc > NIR] : 5.152e-13       
##                                           
##                   Kappa : 0.6774          
##                                           
##  Mcnemar's Test P-Value : 0.3965          
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9287   0.6827  0.74138
## Specificity            0.7901   0.9273  0.98618
## Pos Pred Value         0.9373   0.6174  0.82692
## Neg Pred Value         0.7665   0.9444  0.97717
## Prevalence             0.7715   0.1467  0.08181
## Detection Rate         0.7165   0.1001  0.06065
## Detection Prevalence   0.7645   0.1622  0.07334
## Balanced Accuracy      0.8594   0.8050  0.86378

Fetal health values are predicted with Penalized Regression Model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9097 So, 91% of the test set is predicted correctly.

Decision Trees (DT)

In the Decision Trees, the minimal number of observations per tree leaf and complexity parameter are tuned with cross validation.
For the minimal number of observations per tree leaf, 5, 10, 15, 20, 25 and 30 are used, and for complexity parameter 0.005, 0.01, 0.015, 0.02, 0.025 and 0.03 are used.

set.seed(3)
health_dt_minbucket <- tune.rpart(fetal_health ~ ., data = health_train, 
    minbucket = seq(10, 35, 5))
plot(health_dt_minbucket, main = "Performance of rpart vs. minbucket")

health_dt_minbucket$best.parameters$minbucket
## [1] 10
health_dt_cp <- tune.rpart(fetal_health ~ ., data = health_train, 
    cp = seq(0.005, 0.03, 0.005))
plot(health_dt_cp, main = "Performance of rpart vs. cp")

health_dt_cp$best.parameters$cp
## [1] 0.015

As the best parameter value, the minimal number of observations per tree leaf takes the value of 10 and complexity parameter takes the value of 0.015.

health_dt <- rpart(fetal_health ~ ., data = health_train, method = "class", 
    control = rpart.control(minbucket = health_dt_minbucket$best.parameters$minbucket, 
        cp = health_dt_cp$best.parameters$cp))
fancyRpartPlot(health_dt)

health_dt_pred <- data.frame(predict(health_dt, health_test[, 
    -22], type = "class"))
colnames(health_dt_pred) <- "s0"
health_dt_pred$s0 <- as.factor(health_dt_pred$s0)
confusionMatrix(health_dt_pred$s0, health_test$fetal_health)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 531  41   8
##          2   7  62   1
##          3   9   1  49
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9055         
##                  95% CI : (0.8815, 0.926)
##     No Information Rate : 0.7715         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7281         
##                                          
##  Mcnemar's Test P-Value : 2.333e-05      
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9707  0.59615  0.84483
## Specificity            0.6975  0.98678  0.98464
## Pos Pred Value         0.9155  0.88571  0.83051
## Neg Pred Value         0.8760  0.93427  0.98615
## Prevalence             0.7715  0.14669  0.08181
## Detection Rate         0.7489  0.08745  0.06911
## Detection Prevalence   0.8181  0.09873  0.08322
## Balanced Accuracy      0.8341  0.79147  0.91473

Fetal health values are predicted with Decision Tree model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9055 So, 90% of the test set is predicted correctly.

Random Forests (RF)

set.seed(4)
health_rf <- randomForest(data.matrix(health_train[, -22]), health_train$fetal_health, 
    ntree = 500, nodesize = 5)
health_rf$mtry
## [1] 4

With the default parameters of Random Forest, bootstrap samples uses a random sample of 4 features while splitting each node. For that parameter, 2, 4, 6, 8, 10 and 12 are used while tuning.

fitControl <- trainControl(method = "repeatedcv", number = 3, 
    repeats = 2, search = "grid")
tunegrid <- expand.grid(.mtry = seq(2, 12, 2))
health_rf <- train(fetal_health ~ ., data = health_train, method = "rf", 
    metric = "Accuracy", trControl = fitControl, tuneGrid = tunegrid)
print(health_rf)
## Random Forest 
## 
## 1417 samples
##   21 predictor
##    3 classes: '1', '2', '3' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 2 times) 
## Summary of sample sizes: 946, 944, 944, 945, 944, 945, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9315478  0.8016754
##    4    0.9421344  0.8353148
##    6    0.9438954  0.8401843
##    8    0.9438962  0.8411073
##   10    0.9435431  0.8402300
##   12    0.9456647  0.8458604
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 12.
plot(health_rf)

According to the accuracy values, mtry is selected as 6. It has accuracy value of 0.9361331

health_rf <- randomForest(health_train[, -22], health_train$fetal_health, 
    ntree = 500, nodesize = 5, mtry = 6)
health_rf
## 
## Call:
##  randomForest(x = health_train[, -22], y = health_train$fetal_health,      ntree = 500, mtry = 6, nodesize = 5) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 5.15%
## Confusion matrix:
##      1   2   3 class.error
## 1 1087  18   3  0.01895307
## 2   38 151   2  0.20942408
## 3    5   7 106  0.10169492

Error rate of OOB estimate is 6.49%. Also, the class error of 1(normal) is 2%, the class error of class 2(suspect) is 27% and the class error of 3(pathological) is 12%. This difference can be the result of class imbalance in the dataset. Most of the data contains the class of 1.

varImpPlot(health_rf)

According to the variable importance plot, mean value of short term variability, abnormal short term variability and percentage of time with abnormal long term variability are the most important feature and they have the maximum effect on Gini index.

health_rf_pred <- predict(health_rf, health_test[, -22], type = "class")
confusionMatrix(health_rf_pred, health_test$fetal_health)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 532  31   4
##          2  10  72   2
##          3   5   1  52
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9252          
##                  95% CI : (0.9034, 0.9435)
##     No Information Rate : 0.7715          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.7917          
##                                           
##  Mcnemar's Test P-Value : 0.01069         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9726   0.6923  0.89655
## Specificity            0.7840   0.9802  0.99078
## Pos Pred Value         0.9383   0.8571  0.89655
## Neg Pred Value         0.8944   0.9488  0.99078
## Prevalence             0.7715   0.1467  0.08181
## Detection Rate         0.7504   0.1016  0.07334
## Detection Prevalence   0.7997   0.1185  0.08181
## Balanced Accuracy      0.8783   0.8362  0.94367

Fetal health values are predicted with Random Forest. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9549 So, 95% of the test set is predicted correctly.

Stochastic Gradient Boosting (SGB)

In the Stochastic Gradient Boosting, depth of the tree, learning rate and number of trees are tuned with cross validation.
For depth of the tree, 1, 2 and 3 are used, for learning rate 0.001, 0.005 and 0.01 are used and for number of trees 50, 100 and 150 are used. The minimal number of observations per tree leaf is 10.

set.seed(5)
fitControl <- trainControl(method = "repeatedcv", number = 5, 
    repeats = 3, verboseIter = FALSE, summaryFunction = multiClassSummary, 
    allowParallel = FALSE)
tunegrid <- expand.grid(interaction.depth = c(1, 2, 3), shrinkage = c(0.001, 
    0.005, 0.01), n.trees = c(50, 100, 150), n.minobsinnode = 10)
garbage <- capture.output(health_gbm <- train(fetal_health ~ 
    ., data = health_train, method = "gbm", trControl = fitControl, 
    tuneGrid = tunegrid))
print(health_gbm)
## Stochastic Gradient Boosting 
## 
## 1417 samples
##   21 predictor
##    3 classes: '1', '2', '3' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 1134, 1133, 1135, 1133, 1133, 1134, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  Accuracy   Kappa      Mean_F1  
##   0.001      1                   50      0.8788676  0.6327877  0.7169187
##   0.001      1                  100      0.8788701  0.6310139  0.7162940
##   0.001      1                  150      0.8791048  0.6315316  0.7172609
##   0.001      2                   50      0.9047458  0.7198222  0.8136787
##   0.001      2                  100      0.9045152  0.7185403  0.8120870
##   0.001      2                  150      0.9054541  0.7205947  0.8153378
##   0.001      3                   50      0.9134470  0.7456027  0.8366522
##   0.001      3                  100      0.9146282  0.7495203  0.8391715
##   0.001      3                  150      0.9141596  0.7471088  0.8366850
##   0.005      1                   50      0.8800413  0.6349879  0.7175838
##   0.005      1                  100      0.8802836  0.6329188  0.7187789
##   0.005      1                  150      0.8833427  0.6404088  0.7256026
##   0.005      2                   50      0.9071073  0.7241840  0.8183971
##   0.005      2                  100      0.9101598  0.7318914  0.8236594
##   0.005      2                  150      0.9132156  0.7409628  0.8300266
##   0.005      3                   50      0.9155755  0.7507752  0.8398699
##   0.005      3                  100      0.9191016  0.7613765  0.8468611
##   0.005      3                  150      0.9247396  0.7787165  0.8570079
##   0.010      1                   50      0.8814564  0.6368102  0.7223253
##   0.010      1                  100      0.8878127  0.6539154  0.7393570
##   0.010      1                  150      0.9035687  0.7037120  0.7979462
##   0.010      2                   50      0.9122783  0.7387567  0.8284104
##   0.010      2                  100      0.9198008  0.7629265  0.8458993
##   0.010      2                  150      0.9256686  0.7824400  0.8605128
##   0.010      3                   50      0.9183932  0.7588268  0.8444687
##   0.010      3                  100      0.9263819  0.7842595  0.8614184
##   0.010      3                  150      0.9320182  0.8029033  0.8738376
##   Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
##   0.6741869         0.8748725         0.8240197            0.9226691          
##   0.6729861         0.8735491         0.8231083            0.9239785          
##   0.6734671         0.8733659         0.8239362            0.9244624          
##   0.7817941         0.8965016         0.8576302            0.9307704          
##   0.7794450         0.8961497         0.8570856            0.9316158          
##   0.7819187         0.8954298         0.8618817            0.9323942          
##   0.8100817         0.9020178         0.8756090            0.9376116          
##   0.8121250         0.9035377         0.8772682            0.9383712          
##   0.8086914         0.9020330         0.8771252            0.9391352          
##   0.6744508         0.8750543         0.8234946            0.9257624          
##   0.6726294         0.8716323         0.8283817            0.9269136          
##   0.6764100         0.8722870         0.8350154            0.9312607          
##   0.7829783         0.8947120         0.8687553            0.9351144          
##   0.7856798         0.8950572         0.8776225            0.9393335          
##   0.7913687         0.8978328         0.8846436            0.9424546          
##   0.8112061         0.9020474         0.8824858            0.9409975          
##   0.8183328         0.9055392         0.8894916            0.9442760          
##   0.8283022         0.9122522         0.8977370            0.9494469          
##   0.6758310         0.8731517         0.8317869            0.9280922          
##   0.6881381         0.8748241         0.8434804            0.9353116          
##   0.7456249         0.8835748         0.8798007            0.9433789          
##   0.7916976         0.8980346         0.8807255            0.9413711          
##   0.8116838         0.9067995         0.8927703            0.9465946          
##   0.8306779         0.9151898         0.9000709            0.9487598          
##   0.8148972         0.9044639         0.8877499            0.9443435          
##   0.8343105         0.9147517         0.8994646            0.9500495          
##   0.8511530         0.9230663         0.9056160            0.9523427          
##   Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.8240197       0.6741869    0.2929559            0.7745297             
##   0.8231083       0.6729861    0.2929567            0.7732676             
##   0.8239362       0.6734671    0.2930349            0.7734165             
##   0.8576302       0.7817941    0.3015819            0.8391479             
##   0.8570856       0.7794450    0.3015051            0.8377974             
##   0.8618817       0.7819187    0.3018180            0.8386743             
##   0.8756090       0.8100817    0.3044823            0.8560498             
##   0.8772682       0.8121250    0.3048761            0.8578314             
##   0.8771252       0.8086914    0.3047199            0.8553622             
##   0.8234946       0.6744508    0.2933471            0.7747526             
##   0.8283817       0.6726294    0.2934279            0.7721308             
##   0.8350154       0.6764100    0.2944476            0.7743485             
##   0.8687553       0.7829783    0.3023691            0.8388451             
##   0.8776225       0.7856798    0.3033866            0.8403685             
##   0.8846436       0.7913687    0.3044052            0.8446007             
##   0.8824858       0.8112061    0.3051918            0.8566267             
##   0.8894916       0.8183328    0.3063672            0.8619360             
##   0.8977370       0.8283022    0.3082465            0.8702772             
##   0.8317869       0.6758310    0.2938188            0.7744914             
##   0.8434804       0.6881381    0.2959376            0.7814811             
##   0.8798007       0.7456249    0.3011896            0.8145998             
##   0.8807255       0.7916976    0.3040928            0.8448661             
##   0.8927703       0.8116838    0.3066003            0.8592416             
##   0.9000709       0.8306779    0.3085562            0.8729338             
##   0.8877499       0.8148972    0.3061311            0.8596805             
##   0.8994646       0.8343105    0.3087940            0.8745311             
##   0.9056160       0.8511530    0.3106727            0.8871097             
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(health_gbm)

According to the results, final values used for the model is n.trees=150, interaction.depth=3, shrinkage=0.01 and n.minobsinnode = 10.

health_gbm_final = suppressWarnings(gbm(fetal_health ~ ., data = health_train, 
    distribution = "multinomial", n.trees = 150, interaction.depth = 3, 
    n.minobsinnode = 10, shrinkage = 0.01))
summary(health_gbm_final)

##                                                                                                           var
## abnormal_short_term_variability                                               abnormal_short_term_variability
## percentage_of_time_with_abnormal_long_term_variability percentage_of_time_with_abnormal_long_term_variability
## histogram_mean                                                                                 histogram_mean
## mean_value_of_short_term_variability                                     mean_value_of_short_term_variability
## prolongued_decelerations                                                             prolongued_decelerations
## accelerations                                                                                   accelerations
## histogram_mode                                                                                 histogram_mode
## `baseline value`                                                                             `baseline value`
## histogram_min                                                                                   histogram_min
## uterine_contractions                                                                     uterine_contractions
## histogram_median                                                                             histogram_median
## histogram_max                                                                                   histogram_max
## histogram_number_of_peaks                                                           histogram_number_of_peaks
## mean_value_of_long_term_variability                                       mean_value_of_long_term_variability
## histogram_width                                                                               histogram_width
## histogram_variance                                                                         histogram_variance
## fetal_movement                                                                                 fetal_movement
## light_decelerations                                                                       light_decelerations
## severe_decelerations                                                                     severe_decelerations
## histogram_number_of_zeroes                                                         histogram_number_of_zeroes
## histogram_tendency                                                                         histogram_tendency
##                                                             rel.inf
## abnormal_short_term_variability                        21.950351068
## percentage_of_time_with_abnormal_long_term_variability 20.419051435
## histogram_mean                                         16.219250601
## mean_value_of_short_term_variability                   15.975734410
## prolongued_decelerations                                6.252649044
## accelerations                                           5.143927031
## histogram_mode                                          2.929799121
## `baseline value`                                        2.637718803
## histogram_min                                           1.979950300
## uterine_contractions                                    1.516021868
## histogram_median                                        1.472416703
## histogram_max                                           1.044207516
## histogram_number_of_peaks                               0.684141734
## mean_value_of_long_term_variability                     0.668165018
## histogram_width                                         0.621640302
## histogram_variance                                      0.418581419
## fetal_movement                                          0.061195631
## light_decelerations                                     0.005197997
## severe_decelerations                                    0.000000000
## histogram_number_of_zeroes                              0.000000000
## histogram_tendency                                      0.000000000

Histogram mean is the most important feature with the value of 22.01981133.

health_gbm_pred <- predict(health_gbm, health_test[, -22], type = "raw")
confusionMatrix(health_gbm_pred, health_test$fetal_health)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3
##          1 533  34   8
##          2   9  69   1
##          3   5   1  49
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9182          
##                  95% CI : (0.8955, 0.9373)
##     No Information Rate : 0.7715          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7673          
##                                           
##  Mcnemar's Test P-Value : 0.001632        
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            0.9744  0.66346  0.84483
## Specificity            0.7407  0.98347  0.99078
## Pos Pred Value         0.9270  0.87342  0.89091
## Neg Pred Value         0.8955  0.94444  0.98624
## Prevalence             0.7715  0.14669  0.08181
## Detection Rate         0.7518  0.09732  0.06911
## Detection Prevalence   0.8110  0.11142  0.07757
## Balanced Accuracy      0.8576  0.82347  0.91781

Fetal health values of the articles are predicted with Stochastic Gradient Boosting. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9309. So, 93% of the test set is predicted correctly.

Conclusion

To be conclude, accuracy values of the prediction of test set are:
- Penalized Regression Approaches (PRA): 0.9097
- Decision Trees (DT): 0.9055
- Random Forests (RF): 0.9549
- Stochastic Gradient Boosting (SGB): 0.9309.

So, the random forest can be selected for this dataset as the best predictive model compared to others.

Dataset 2

Credit Card Customers

Description: Leaving credit card services is a crucial problem for banks.The managers in banks want to predict who is going to churn and after the prediction, they can get precautions to prevent this. Target variable is the attrition of customer (attrited or existing). Also, this dataset contains features such as customer’s age, salary, marital status, credit card limit, credit card category.
Tasks: Classification (This dataset has class imbalance with a ratio of 5:1)
Number of observations: 10127
Number of features: 23
Feature characteristics: Integer, real, categorical, ordinal

After reading the dataset, non-predictive features are excluded and data types are updated. Finally, train and test sets are created.

# reading dataset 2
churn_data <- suppressMessages(read_csv("/Users/iremarica/Desktop/Homework4/BankChurners.csv"))
## Warning: Missing column names filled in: 'X1' [1]
# substracting non-predictive features
churn_data <- churn_data[, -c(1, 2, 23, 24)]

# defining data types
for (i in 1:nrow(churn_data)) {
    if (churn_data[i, 1] == "Attrited Customer") {
        churn_data[i, 1] <- 1
    }
    if (churn_data[i, 1] == "Existing Customer") {
        churn_data[i, 1] <- 0
    }
}
churn_data$Attrition_Flag <- as.factor(churn_data$Attrition_Flag)
churn_data$Gender <- as.factor(churn_data$Gender)
churn_data$Marital_Status <- as.factor(churn_data$Marital_Status)
churn_data$Education_Level <- as.factor(churn_data$Education_Level)
churn_data$Income_Category <- as.factor(churn_data$Income_Category)
churn_data$Card_Category <- as.factor(churn_data$Card_Category)

# creating train and test sets for dataset 2
set.seed(6)
churn_index <- sample(1:nrow(churn_data), (2/3) * nrow(churn_data))
churn_train <- churn_data[churn_index, ]
churn_test <- churn_data[-churn_index, ]
paste("Total:", nrow(churn_data), "  Train:", nrow(churn_train), 
    "  Test:", nrow(churn_test))
## [1] "Total: 10127   Train: 6751   Test: 3376"

Penalized Regression Approaches (PRA)

For determining the Lasso penalty, lambda, 10-folds Cross Validation is used.

set.seed(7)
churn_cv_fit <- cv.glmnet(data.matrix(churn_train[, -1]), as.matrix(churn_train[, 
    1]), family = "binomial", nfolds = 10)
churn_cv_fit
## 
## Call:  cv.glmnet(x = data.matrix(churn_train[, -1]), y = as.matrix(churn_train[,      1]), nfolds = 10, family = "binomial") 
## 
## Measure: Binomial Deviance 
## 
##       Lambda Measure       SE Nonzero
## min 0.000566  0.4745 0.008114      17
## 1se 0.004385  0.4815 0.007770      11
cat(" Lambda min:", churn_cv_fit$lambda.min, "\n", "Lambda 1se:", 
    churn_cv_fit$lambda.1se)
##  Lambda min: 0.0005662878 
##  Lambda 1se: 0.004384561
plot(churn_cv_fit)

After the CV results, minimum lambda value 0.000477 is selected for using in the Penalized Regression Approach.

churn_pra <- glmnet(data.matrix(churn_train[, -1]), as.matrix(churn_train[, 
    1]), family = "binomial", lambda = churn_cv_fit$lambda.min)
churn_pra_pred <- data.frame(predict(churn_pra, data.matrix(churn_test[, 
    -1]), type = "class"))
confusionMatrix(churn_pra_pred[, 1], churn_test$Attrition_Flag)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2735  240
##          1   93  308
##                                           
##                Accuracy : 0.9014          
##                  95% CI : (0.8908, 0.9112)
##     No Information Rate : 0.8377          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5933          
##                                           
##  Mcnemar's Test P-Value : 1.237e-15       
##                                           
##             Sensitivity : 0.9671          
##             Specificity : 0.5620          
##          Pos Pred Value : 0.9193          
##          Neg Pred Value : 0.7681          
##              Prevalence : 0.8377          
##          Detection Rate : 0.8101          
##    Detection Prevalence : 0.8812          
##       Balanced Accuracy : 0.7646          
##                                           
##        'Positive' Class : 0               
## 

Attrition values are predicted with Penalized Regression Model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9043 So, 90% of the test set is predicted correctly.

Decision Trees (DT)

In the Decision Trees, the minimal number of observations per tree leaf and complexity parameter are tuned with cross validation.
For the minimal number of observations per tree leaf, 5, 10, 15, 20, 25 and 30 are used, and for complexity parameter 0.005, 0.01, 0.015, 0.02, 0.025 and 0.03 are used.

set.seed(8)
churn_dt_minbucket <- tune.rpart(Attrition_Flag ~ ., data = churn_train, 
    minbucket = seq(5, 30, 5))
plot(churn_dt_minbucket, main = "Performance of rpart vs. minbucket")

churn_dt_minbucket$best.parameters$minbucket
## [1] 5
churn_dt_cp <- tune.rpart(Attrition_Flag ~ ., data = churn_train, 
    cp = seq(0.005, 0.03, 0.005))
plot(churn_dt_cp, main = "Performance of rpart vs. cp")

churn_dt_cp$best.parameters$cp
## [1] 0.005

As the best parameter value, the minimal number of observations per tree leaf takes the value of 5 and complexity parameter takes the value of 0.005.

churn_dt <- rpart(Attrition_Flag ~ ., data = churn_train, method = "class", 
    control = rpart.control(minbucket = churn_dt_minbucket$best.parameters$minbucket, 
        cp = churn_dt_cp$best.parameters$cp))
fancyRpartPlot(churn_dt)

churn_dt_testpred <- predict(churn_dt, churn_test[, -1], type = "class")
confusionMatrix(churn_dt_testpred, churn_test$Attrition_Flag)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2715   96
##          1  113  452
##                                          
##                Accuracy : 0.9381         
##                  95% CI : (0.9294, 0.946)
##     No Information Rate : 0.8377         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.7752         
##                                          
##  Mcnemar's Test P-Value : 0.2684         
##                                          
##             Sensitivity : 0.9600         
##             Specificity : 0.8248         
##          Pos Pred Value : 0.9658         
##          Neg Pred Value : 0.8000         
##              Prevalence : 0.8377         
##          Detection Rate : 0.8042         
##    Detection Prevalence : 0.8326         
##       Balanced Accuracy : 0.8924         
##                                          
##        'Positive' Class : 0              
## 

Attrition values are predicted with Decision Tree model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9396 So, 94% of the test set is predicted correctly.

Random Forests (RF)

set.seed(9)
churn_rf <- randomForest(churn_train[, -1], churn_train$Attrition_Flag, 
    ntree = 500, nodesize = 5)
churn_rf$mtry
## [1] 4

With the default parameters of Random Forest, bootstrap samples uses a random sample of 4 features while splitting each node. For that parameter, 2, 4, 6, 8, 10 and 12 are used while tuning.

fitControl <- trainControl(method = "repeatedcv", number = 3, 
    repeats = 2, search = "grid")
tunegrid <- expand.grid(.mtry = seq(2, 12, 2))
churn_rf <- train(Attrition_Flag ~ ., data = churn_train, method = "rf", 
    metric = "Accuracy", trControl = fitControl, tuneGrid = tunegrid)
print(churn_rf)
## Random Forest 
## 
## 6751 samples
##   19 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 2 times) 
## Summary of sample sizes: 4500, 4501, 4501, 4500, 4501, 4501, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9155686  0.6128091
##    4    0.9448227  0.7738760
##    6    0.9549695  0.8203862
##    8    0.9591173  0.8389089
##   10    0.9600802  0.8441147
##   12    0.9592655  0.8415980
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 10.
plot(churn_rf)

According to the accuracy values, mtry is selected as 12. It has accuracy value of 0.9609692 It can be seen in the plot, accuracy is increases with the increase in mtry.

churn_rf <- randomForest(churn_train[, -1], churn_train$Attrition_Flag, 
    ntree = 500, nodesize = 5, mtry = 12)
churn_rf
## 
## Call:
##  randomForest(x = churn_train[, -1], y = churn_train$Attrition_Flag,      ntree = 500, mtry = 12, nodesize = 5) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 3.75%
## Confusion matrix:
##      0   1 class.error
## 0 5586  86   0.0151622
## 1  167 912   0.1547729

Error rate of OOB estimate is 3.73%. Also, the class error of 0 is 0.16% and the class error of 1 is 15%. This difference can be the result of class imbalance in the dataset.

varImpPlot(churn_rf)

According to the variable importance plot, Total Transaction value is the most important feature and it has the maximum effect on Gini index.

churn_rf_pred <- predict(churn_rf, churn_test[, -1], type = "class")
confusionMatrix(churn_rf_pred, churn_test$Attrition_Flag)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2779   66
##          1   49  482
##                                           
##                Accuracy : 0.9659          
##                  95% CI : (0.9593, 0.9718)
##     No Information Rate : 0.8377          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8732          
##                                           
##  Mcnemar's Test P-Value : 0.1357          
##                                           
##             Sensitivity : 0.9827          
##             Specificity : 0.8796          
##          Pos Pred Value : 0.9768          
##          Neg Pred Value : 0.9077          
##              Prevalence : 0.8377          
##          Detection Rate : 0.8232          
##    Detection Prevalence : 0.8427          
##       Balanced Accuracy : 0.9311          
##                                           
##        'Positive' Class : 0               
## 

Attrition values are predicted with Random Forest. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9648 So, 96% of the test set is predicted correctly. The main error is results from predicting the existing customers as attrited customer due to class imbalance.

Stochastic Gradient Boosting (SGB)

In the Stochastic Gradient Boosting, depth of the tree, learning rate and number of trees are tuned with cross validation.
For depth of the tree, 1, 2 and 3 are used, for learning rate 0.001, 0.005 and 0.01 are used and for number of trees 50, 100 and 150 are used. The minimal number of observations per tree leaf is 10.

set.seed(10)
fitControl <- trainControl(method = "repeatedcv", number = 5, 
    repeats = 3, verboseIter = FALSE, allowParallel = FALSE)
tunegrid <- expand.grid(interaction.depth = c(1, 2, 3), shrinkage = c(0.001, 
    0.005, 0.01), n.trees = c(50, 100, 150), n.minobsinnode = 10)
garbage <- capture.output(churn_gbm <- train(Attrition_Flag ~ 
    ., data = churn_train, method = "gbm", trControl = fitControl, 
    tuneGrid = tunegrid))
print(churn_gbm)
## Stochastic Gradient Boosting 
## 
## 6751 samples
##   19 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 5402, 5401, 5400, 5400, 5401, 5401, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  Accuracy   Kappa     
##   0.001      1                   50      0.8401719  0.00000000
##   0.001      1                  100      0.8401719  0.00000000
##   0.001      1                  150      0.8401719  0.00000000
##   0.001      2                   50      0.8401719  0.00000000
##   0.001      2                  100      0.8401719  0.00000000
##   0.001      2                  150      0.8401719  0.00000000
##   0.001      3                   50      0.8401719  0.00000000
##   0.001      3                  100      0.8401719  0.00000000
##   0.001      3                  150      0.8401719  0.00000000
##   0.005      1                   50      0.8401719  0.00000000
##   0.005      1                  100      0.8401719  0.00000000
##   0.005      1                  150      0.8401719  0.00000000
##   0.005      2                   50      0.8401719  0.00000000
##   0.005      2                  100      0.8401719  0.00000000
##   0.005      2                  150      0.8482690  0.08218471
##   0.005      3                   50      0.8401719  0.00000000
##   0.005      3                  100      0.8401719  0.00000000
##   0.005      3                  150      0.8590333  0.18983478
##   0.010      1                   50      0.8401719  0.00000000
##   0.010      1                  100      0.8401719  0.00000000
##   0.010      1                  150      0.8477768  0.08012661
##   0.010      2                   50      0.8401719  0.00000000
##   0.010      2                  100      0.8754253  0.33338847
##   0.010      2                  150      0.8959658  0.50276443
##   0.010      3                   50      0.8401719  0.00000000
##   0.010      3                  100      0.8856949  0.41244052
##   0.010      3                  150      0.9104330  0.59221673
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(churn_gbm)

According to the results, final values used for the model is n.trees=150, interaction.depth=3, shrinkage=0.01 and n.minobsinnode = 10. It can be seen in the plot, selecting shrinkage as 0.01 makes a huge difference in accuracy.

churn_gbm_pred <- predict(churn_gbm, churn_test[, -1], type = "raw")
confusionMatrix(churn_gbm_pred, churn_test$Attrition_Flag)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2796  283
##          1   32  265
##                                           
##                Accuracy : 0.9067          
##                  95% CI : (0.8964, 0.9163)
##     No Information Rate : 0.8377          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5792          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9887          
##             Specificity : 0.4836          
##          Pos Pred Value : 0.9081          
##          Neg Pred Value : 0.8923          
##              Prevalence : 0.8377          
##          Detection Rate : 0.8282          
##    Detection Prevalence : 0.9120          
##       Balanced Accuracy : 0.7361          
##                                           
##        'Positive' Class : 0               
## 

Attrition values are predicted with Stochastic Gradient Boosting. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.9188 So, 92% of the test set is predicted correctly. Again, the main error is results from predicting the existing customers as attrited customer due to class imbalance.

Conclusion

To be conclude, accuracy values of the prediction of test set are:
- Penalized Regression Approaches (PRA): 0.9043
- Decision Trees (DT): 0.9396
- Random Forests (RF): 0.9648
- Stochastic Gradient Boosting (SGB): 0.9188

So, the random forest can be selected for this dataset as the best predictive model compared to others.

Dataset 3

Sports Articles for Objectivity Analysis

Description: 1000 sports articles were labeled using Amazon Mechanical Turk as objective or subjective. The main aim is the predicting the article’s objectivity with investigating the usage of nouns, adjectives, adverbs, symbols, etc. Target variable is the label of article (objective/subjective).
Tasks: Classification
Number of observations: 1000
Number of features: 62
Feature characteristics: Integer

After reading the dataset, non-predictive features are excluded and data types are updated. Finally, train and test sets are created.

# reading dataset 3
article_data <- suppressMessages(read_csv("/Users/iremarica/Desktop/Homework4/SportsArticles.csv"))
## Warning: Missing column names filled in: 'X1' [1]
# substracting non-predictive features (X1, TextID and URL)
article_data <- article_data[, -c(1, 2, 3)]

# defining data types
article_data$Label <- as.factor(article_data$Label)

# creating train and test sets for dataset 3
set.seed(11)
article_index <- sample(1:nrow(article_data), (2/3) * nrow(article_data))
article_train <- article_data[article_index, ]
article_test <- article_data[-article_index, ]
paste("Total:", nrow(article_data), "  Train:", nrow(article_train), 
    "  Test:", nrow(article_test))
## [1] "Total: 1000   Train: 666   Test: 334"

Penalized Regression Approaches (PRA)

For determining the Lasso penalty, lambda, 10-folds Cross Validation is used.

set.seed(12)
article_cv_fit <- cv.glmnet(data.matrix(article_train[, -1]), 
    as.matrix(article_train[, 1]), family = "binomial", nfolds = 10)
article_cv_fit
## 
## Call:  cv.glmnet(x = data.matrix(article_train[, -1]), y = as.matrix(article_train[,      1]), nfolds = 10, family = "binomial") 
## 
## Measure: Binomial Deviance 
## 
##      Lambda Measure      SE Nonzero
## min 0.00388  0.9166 0.06775      41
## 1se 0.04355  0.9812 0.03767      13
cat(" Lambda min:", article_cv_fit$lambda.min, "\n", "Lambda 1se:", 
    article_cv_fit$lambda.1se)
##  Lambda min: 0.00387694 
##  Lambda 1se: 0.0435506
plot(article_cv_fit)

After the CV results, minimum lambda value 0.00981 is selected for using in the Penalized Regression Approach.

article_pra <- glmnet(data.matrix(article_train[, -1]), as.matrix(article_train[, 
    1]), family = "binomial", lambda = article_cv_fit$lambda.min)
article_pra_pred <- data.frame(predict(article_pra, data.matrix(article_test[, 
    -1]), type = "class"))
article_pra_pred$s0 <- as.factor(article_pra_pred$s0)
confusionMatrix(article_pra_pred$s0, article_test$Label)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   objective subjective
##   objective        192         39
##   subjective        20         83
##                                           
##                Accuracy : 0.8234          
##                  95% CI : (0.7781, 0.8627)
##     No Information Rate : 0.6347          
##     P-Value [Acc > NIR] : 3.137e-14       
##                                           
##                   Kappa : 0.606           
##                                           
##  Mcnemar's Test P-Value : 0.01911         
##                                           
##             Sensitivity : 0.9057          
##             Specificity : 0.6803          
##          Pos Pred Value : 0.8312          
##          Neg Pred Value : 0.8058          
##              Prevalence : 0.6347          
##          Detection Rate : 0.5749          
##    Detection Prevalence : 0.6916          
##       Balanced Accuracy : 0.7930          
##                                           
##        'Positive' Class : objective       
## 

Labels of the articles are predicted with Penalized Regression Model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.8443 So, 84% of the test set is predicted correctly.

Decision Trees (DT)

In the Decision Trees, the minimal number of observations per tree leaf and complexity parameter are tuned with cross validation.
For the minimal number of observations per tree leaf, 5, 10, 15, 20, 25 and 30 are used, and for complexity parameter 0.005, 0.01, 0.015, 0.02, 0.025 and 0.03 are used.

set.seed(13)
article_dt_minbucket <- tune.rpart(Label ~ ., data = article_train, 
    minbucket = seq(5, 30, 5))
plot(article_dt_minbucket, main = "Performance of rpart vs. minbucket")

article_dt_minbucket$best.parameters$minbucket
## [1] 5
article_dt_cp <- tune.rpart(Label ~ ., data = article_train, 
    cp = seq(0.005, 0.03, 0.005))
plot(article_dt_cp, main = "Performance of rpart vs. cp")

article_dt_cp$best.parameters$cp
## [1] 0.03

As the best parameter value, the minimal number of observations per tree leaf takes the value of 5 and complexity parameter takes the value of 0.01.

article_dt <- rpart(Label ~ ., data = article_train, method = "class", 
    control = rpart.control(minbucket = article_dt_minbucket$best.parameters$minbucket, 
        cp = article_dt_cp$best.parameters$cp))
fancyRpartPlot(article_dt)

article_dt_pred <- predict(article_dt, article_test[, -1], type = "class")
confusionMatrix(article_dt_pred, article_test$Label)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   objective subjective
##   objective        165         28
##   subjective        47         94
##                                           
##                Accuracy : 0.7754          
##                  95% CI : (0.7269, 0.8191)
##     No Information Rate : 0.6347          
##     P-Value [Acc > NIR] : 2.173e-08       
##                                           
##                   Kappa : 0.5312          
##                                           
##  Mcnemar's Test P-Value : 0.03767         
##                                           
##             Sensitivity : 0.7783          
##             Specificity : 0.7705          
##          Pos Pred Value : 0.8549          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6347          
##          Detection Rate : 0.4940          
##    Detection Prevalence : 0.5778          
##       Balanced Accuracy : 0.7744          
##                                           
##        'Positive' Class : objective       
## 

Attrition values are predicted with Decision Tree model. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.8353 So, 83% of the test set is predicted correctly.

Random Forests (RF)

set.seed(14)
article_rf <- randomForest(article_train[, -1], article_train$Label, 
    ntree = 500, nodesize = 5)
article_rf$mtry
## [1] 7

With the default parameters of Random Forest, bootstrap samples uses a random sample of 7 features while splitting each node. For that parameter, 5, 7, 9, 11, 13 and 15 are used while tuning.

fitControl <- trainControl(method = "repeatedcv", number = 5, 
    repeats = 3, search = "grid")
tunegrid <- expand.grid(.mtry = seq(5, 15, 2))
article_rf <- train(Label ~ ., data = article_train, method = "rf", 
    metric = "Accuracy", trControl = fitControl, tuneGrid = tunegrid)
print(article_rf)
## Random Forest 
## 
## 666 samples
##  59 predictor
##   2 classes: 'objective', 'subjective' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 532, 533, 533, 533, 533, 533, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    5    0.8182595  0.6030295
##    7    0.8187343  0.6054435
##    9    0.8167254  0.6006689
##   11    0.8187456  0.6054645
##   13    0.8202531  0.6090822
##   15    0.8157455  0.5991784
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 13.
plot(article_rf)

According to the accuracy values, mtry is selected as 9. It has accuracy value of 0.8133281

article_rf <- randomForest(article_train[, -1], article_train$Label, 
    ntree = 500, nodesize = 5, mtry = 9)
article_rf
## 
## Call:
##  randomForest(x = article_train[, -1], y = article_train$Label,      ntree = 500, mtry = 9, nodesize = 5) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 18.92%
## Confusion matrix:
##            objective subjective class.error
## objective        359         64   0.1513002
## subjective        62        181   0.2551440

Error rate of OOB estimate is 18.62%. Also, the class error of objective is 0.11% and the class error of 1 is 30%.

varImpPlot(article_rf)

According to the variable importance plot, LS:Frequency of list item markers and PRP$:Frequency of possessive pronouns are the most important features and they have the maximum effect on Gini index.

article_rf_pred <- predict(article_rf, article_test[, -1], type = "class")
confusionMatrix(article_rf_pred, article_test$Label)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   objective subjective
##   objective        176         26
##   subjective        36         96
##                                           
##                Accuracy : 0.8144          
##                  95% CI : (0.7684, 0.8546)
##     No Information Rate : 0.6347          
##     P-Value [Acc > NIR] : 5.622e-13       
##                                           
##                   Kappa : 0.6065          
##                                           
##  Mcnemar's Test P-Value : 0.253           
##                                           
##             Sensitivity : 0.8302          
##             Specificity : 0.7869          
##          Pos Pred Value : 0.8713          
##          Neg Pred Value : 0.7273          
##              Prevalence : 0.6347          
##          Detection Rate : 0.5269          
##    Detection Prevalence : 0.6048          
##       Balanced Accuracy : 0.8085          
##                                           
##        'Positive' Class : objective       
## 

Labels of the articles are predicted with Random Forest. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.8473 So, 85% of the test set is predicted correctly.

Stochastic Gradient Boosting (SGB)

In the Stochastic Gradient Boosting, depth of the tree, learning rate and number of trees are tuned with cross validation.
For depth of the tree, 1, 2 and 3 are used, for learning rate 0.001, 0.005 and 0.01 are used and for number of trees 50, 100 and 150 are used. The minimal number of observations per tree leaf is 10.

set.seed(15)
fitControl <- trainControl(method = "repeatedcv", number = 5, 
    repeats = 3, verboseIter = FALSE, summaryFunction = twoClassSummary, 
    classProbs = TRUE, allowParallel = FALSE)
tunegrid <- expand.grid(interaction.depth = c(1, 2, 3), shrinkage = c(0.001, 
    0.005, 0.01), n.trees = c(50, 100, 150), n.minobsinnode = 10)
garbage <- suppressWarnings(capture.output(article_gbm <- train(Label ~ 
    ., data = article_train, method = "gbm", trControl = fitControl, 
    tuneGrid = tunegrid)))
print(article_gbm)
## Stochastic Gradient Boosting 
## 
## 666 samples
##  59 predictor
##   2 classes: 'objective', 'subjective' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 532, 533, 533, 533, 533, 532, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  ROC        Sens       Spec     
##   0.001      1                   50      0.8472028  1.0000000  0.0000000
##   0.001      1                  100      0.8487066  1.0000000  0.0000000
##   0.001      1                  150      0.8493557  1.0000000  0.0000000
##   0.001      2                   50      0.8525256  1.0000000  0.0000000
##   0.001      2                  100      0.8537802  1.0000000  0.0000000
##   0.001      2                  150      0.8543623  1.0000000  0.0000000
##   0.001      3                   50      0.8563527  1.0000000  0.0000000
##   0.001      3                  100      0.8573803  1.0000000  0.0000000
##   0.001      3                  150      0.8573920  1.0000000  0.0000000
##   0.005      1                   50      0.8469414  1.0000000  0.0000000
##   0.005      1                  100      0.8523878  0.9338936  0.4827381
##   0.005      1                  150      0.8509636  0.8960691  0.5924887
##   0.005      2                   50      0.8546853  1.0000000  0.0000000
##   0.005      2                  100      0.8567238  0.9354715  0.4896825
##   0.005      2                  150      0.8584464  0.8960504  0.6146259
##   0.005      3                   50      0.8586486  1.0000000  0.0000000
##   0.005      3                  100      0.8596539  0.9448833  0.4870465
##   0.005      3                  150      0.8610910  0.9070588  0.6062642
##   0.010      1                   50      0.8478035  0.9354342  0.4744331
##   0.010      1                  100      0.8513792  0.8802988  0.6405896
##   0.010      1                  150      0.8521180  0.8676751  0.6640306
##   0.010      2                   50      0.8580020  0.9417460  0.4925454
##   0.010      2                  100      0.8601241  0.8881793  0.6599206
##   0.010      2                  150      0.8629500  0.8787208  0.6996315
##   0.010      3                   50      0.8597358  0.9472456  0.4938776
##   0.010      3                  100      0.8627375  0.8921289  0.6612812
##   0.010      3                  150      0.8653259  0.8795145  0.7010488
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(article_gbm)

According to the results, final values used for the model is n.trees=150, interaction.depth=3, shrinkage=0.01 and n.minobsinnode = 10.

article_gbm_pred <- predict(article_gbm, article_test[, -1], 
    type = "raw")
confusionMatrix(article_gbm_pred, article_test$Label)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   objective subjective
##   objective        183         29
##   subjective        29         93
##                                           
##                Accuracy : 0.8263          
##                  95% CI : (0.7814, 0.8654)
##     No Information Rate : 0.6347          
##     P-Value [Acc > NIR] : 1.152e-14       
##                                           
##                   Kappa : 0.6255          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8632          
##             Specificity : 0.7623          
##          Pos Pred Value : 0.8632          
##          Neg Pred Value : 0.7623          
##              Prevalence : 0.6347          
##          Detection Rate : 0.5479          
##    Detection Prevalence : 0.6347          
##       Balanced Accuracy : 0.8128          
##                                           
##        'Positive' Class : objective       
## 

Labels of the articles are predicted with Stochastic Gradient Boosting. Then, confusion matrix is used for comparing the actual and predicted values in the test set. According to the confusion matrix, accuracy of the prediction is 0.8413 So, 84% of the test set is predicted correctly. Again, the main error is results from predicting the existing customers as attrited customer due to class imbalance.

Conclusion

To be conclude, accuracy values of the prediction of test set are:
- Penalized Regression Approaches (PRA): 0.8443
- Decision Trees (DT): 0.8353
- Random Forests (RF): 0.8473
- Stochastic Gradient Boosting (SGB): 0.8413

So, the random forest can be selected for this dataset as the best predictive model compared to others. However, their all has nearly the same accuracy value. So, additional improvements that can increase accuracy may be beneficial.

Dataset 4

Superconductivity

Description: This dataset contains data about superconductors and their relevant features. The aim is predicting the critical temperature for superconductor. Tasks: Regression Number of observations: 21263
Number of features: 81 Feature characteristics: Real

After reading the dataset, train and test sets are created. Originally, this dataset was having 21263 onservations. However, this analysis is made with random sample of 5000 due to the problems in running in R.

# reading dataset 4
cond_data <- suppressMessages(read_csv("/Users/iremarica/Desktop/Homework4/superconductor.csv"))
set.seed(16)
cond_index <- sample(1:nrow(cond_data), 5000)
cond_data <- cond_data[cond_index, ]

# creating train and test sets for dataset 4
set.seed(17)
cond_index <- sample(1:nrow(cond_data), (2/3) * nrow(cond_data))
cond_train <- cond_data[cond_index, ]
cond_test <- cond_data[-cond_index, ]
paste("Total:", nrow(cond_data), "  Train:", nrow(cond_train), 
    "  Test:", nrow(cond_test))
## [1] "Total: 5000   Train: 3333   Test: 1667"

Penalized Regression Approaches (PRA)

set.seed(18)
cond_cv_fit <- cv.glmnet(data.matrix(cond_train[, -82]), data.matrix(cond_train[, 
    82]), family = "gaussian", nfolds = 10)
cond_cv_fit
## 
## Call:  cv.glmnet(x = data.matrix(cond_train[, -82]), y = data.matrix(cond_train[,      82]), nfolds = 10, family = "gaussian") 
## 
## Measure: Mean-Squared Error 
## 
##      Lambda Measure    SE Nonzero
## min 0.00240   316.2 11.12      78
## 1se 0.03567   326.5 10.88      62
cat(" Lambda min:", cond_cv_fit$lambda.min, "\n", "Lambda 1se:", 
    cond_cv_fit$lambda.1se)
##  Lambda min: 0.002402019 
##  Lambda 1se: 0.03566921
plot(cond_cv_fit)

After the CV results, minimum lambda value 0.002459 is selected for using in the Penalized Regression Approach.

cond_pra <- glmnet(data.matrix(cond_train[, -82]), cond_train$critical_temp, 
    family = "gaussian", lambda = cond_cv_fit$lambda.min)
cond_pra_pred <- data.frame(predict(cond_pra, data.matrix(cond_test[, 
    -82])))
colnames(cond_pra_pred) <- "s0"
rmse(cond_test$critical_temp, cond_pra_pred$s0)
## [1] 17.68855

Critical temperature values are predicted with Penalized Ragression Approach. Then, RMSE (Root Mean Squared Error) values are used for comparing the models. RMSE value of Penalized Ragression Approach is 17.47174.

Decision Trees (DT)

In the Decision Trees, the minimal number of observations per tree leaf and complexity parameter are tuned with cross validation.
For the minimal number of observations per tree leaf, 5, 6 and 7 are used, and for complexity parameter 0.005, 0.01 and 0.015 are used.

set.seed(19)
cond_dt_minbucket <- tune.rpart(critical_temp ~ ., data = cond_train, 
    minbucket = c(5, 6, 7))
plot(cond_dt_minbucket, main = "Performance of rpart vs. minbucket")

cond_dt_minbucket$best.parameters$minbucket
## [1] 5
cond_dt_cp <- tune.rpart(critical_temp ~ ., data = cond_train, 
    cp = c(0.005, 0.01, 0.015))
plot(cond_dt_cp, main = "Performance of rpart vs. cp")

cond_dt_cp$best.parameters$cp
## [1] 0.005

As the best parameter value, the minimal number of observations per tree leaf takes the value of 5 and complexity parameter takes the value of 0.005.

cond_dt <- rpart(critical_temp ~ ., data = cond_train, method = "anova", 
    control = rpart.control(minbucket = cond_dt_minbucket$best.parameters$minbucket, 
        cp = cond_dt_cp$best.parameters$cp))
fancyRpartPlot(cond_dt)

cond_dt_pred <- data.frame(predict(cond_dt, cond_test[, -82]))
colnames(cond_dt_pred) <- "s0"
rmse(cond_test$critical_temp, cond_dt_pred$s0)
## [1] 17.41911

Critical temperature values are predicted with Decision Tree model. Then, RMSE values are used for comparing the models. RMSE value of Decision Tree is 17.28844.

Random Forests (RF)

set.seed(20)
cond_rf <- randomForest(data.matrix(cond_train[, -82]), cond_train$critical_temp, 
    ntree = 500, nodesize = 5)
cond_rf$mtry
## [1] 27

With the default parameters of Random Forest, bootstrap samples uses a random sample of 27 features while splitting each node. For that parameter, 25, 27 and 29 are used while tuning.

fitControl <- trainControl(method = "repeatedcv", number = 3, 
    repeats = 2, search = "grid")
tunegrid <- expand.grid(.mtry = seq(25, 27, 29))
cond_rf <- train(critical_temp ~ ., data = cond_train, method = "rf", 
    trControl = fitControl, tuneGrid = tunegrid)
print(cond_rf)
## Random Forest 
## 
## 3333 samples
##   81 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold, repeated 2 times) 
## Summary of sample sizes: 2223, 2221, 2222, 2223, 2222, 2221, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   12.07697  0.8711659  7.61113
## 
## Tuning parameter 'mtry' was held constant at a value of 25

According to the RMSE values, mtry is selected as 25.

cond_rf <- randomForest(cond_train[, -82], cond_train$critical_temp, 
    ntree = 500, nodesize = 5, mtry = 25)
varImpPlot(cond_rf)

According to the variable importance plot, range of thermal conductivity is the most important feature and it has the maximum effect on node purity.

cond_rf_pred <- data.frame(predict(cond_rf, cond_test[, -82]))
colnames(cond_rf_pred) <- "s0"
rmse(cond_test$critical_temp, cond_rf_pred$s0)
## [1] 11.7012

Critical temperature values are predicted with Random Forest. Then, RMSE values are used for comparing the models. RMSE value of Random Forest is 11.70342

Stochastic Gradient Boosting (SGB)

In the Stochastic Gradient Boosting, depth of the tree, learning rate and number of trees are tuned with cross validation.
For depth of the tree, 1, 2 and 3 are used, for learning rate 0.001, 0.005 and 0.01 are used and for number of trees 50, 100 and 150 are used. The minimal number of observations per tree leaf is 10.

set.seed(15)
fitControl <- trainControl(method = "repeatedcv", number = 5, 
    repeats = 3, verboseIter = FALSE, allowParallel = FALSE)
tunegrid <- expand.grid(interaction.depth = c(1, 2, 3), shrinkage = c(0.001, 
    0.005, 0.01), n.trees = c(50, 100, 150), n.minobsinnode = 10)
garbage <- suppressWarnings(capture.output(cond_gbm <- train(critical_temp ~ 
    ., data = cond_train, method = "gbm", trControl = fitControl, 
    tuneGrid = tunegrid)))
print(cond_gbm)
## Stochastic Gradient Boosting 
## 
## 3333 samples
##   81 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 2666, 2667, 2666, 2665, 2668, 2665, ... 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.trees  RMSE      Rsquared   MAE     
##   0.001      1                   50      32.62523  0.5533129  27.67605
##   0.001      1                  100      31.84922  0.5748501  26.97639
##   0.001      1                  150      31.11919  0.5890751  26.31968
##   0.001      2                   50      32.42004  0.6559535  27.47111
##   0.001      2                  100      31.44503  0.6576384  26.58487
##   0.001      2                  150      30.53580  0.6599337  25.75881
##   0.001      3                   50      32.36867  0.6954543  27.43889
##   0.001      3                  100      31.34418  0.6971619  26.52150
##   0.001      3                  150      30.38621  0.6991506  25.66557
##   0.005      1                   50      29.77755  0.6085155  25.10161
##   0.005      1                  100      27.04066  0.6333489  22.56833
##   0.005      1                  150      24.98056  0.6438804  20.59398
##   0.005      2                   50      28.88745  0.6642530  24.25788
##   0.005      2                  100      25.64711  0.6778480  21.29254
##   0.005      2                  150      23.31911  0.6920473  19.11261
##   0.005      3                   50      28.64215  0.7012131  24.10565
##   0.005      3                  100      25.17344  0.7146197  20.99371
##   0.005      3                  150      22.64269  0.7271795  18.69022
##   0.010      1                   50      27.02805  0.6333693  22.55584
##   0.010      1                  100      23.45461  0.6496371  19.05223
##   0.010      1                  150      21.47307  0.6621218  16.95981
##   0.010      2                   50      25.63618  0.6782711  21.28608
##   0.010      2                  100      21.62097  0.7035182  17.48365
##   0.010      2                  150      19.45361  0.7228087  15.37634
##   0.010      3                   50      25.15205  0.7128403  20.97529
##   0.010      3                  100      20.75616  0.7376875  16.91859
##   0.010      3                  150      18.43074  0.7532463  14.64805
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.01 and n.minobsinnode = 10.
plot(cond_gbm)

According to the results, final values used for the model is n.trees=150, interaction.depth=3, shrinkage=0.01 and n.minobsinnode = 10.

cond_gbm_pred <- data.frame(predict(cond_gbm, cond_test[, -82]))
colnames(cond_rf_pred) <- "s0"
rmse(cond_test$critical_temp, cond_rf_pred$s0)
## [1] 11.7012

Critical temperatures are predicted with Stochastic Gradient Boosting. Then, RMSE values are used for comparing the models. RMSE value of Stochastic Gradient Boosting is 11.70342.

Conclusion

To be conclude, RMSE values of the prediction of test set are:
- Penalized Regression Approaches (PRA): 17.47174
- Decision Trees (DT): 17.28844
- Random Forests (RF): 11.70342
- Stochastic Gradient Boosting (SGB): 11.70342

So, Random Forest or Stochastic Gradient Boosting can be selected for this dataset as the best predictive model compared to others, since they have the same and minimum RMSE value.

Summary

In the classification problems, random forest models are selected as the best model compared to other. Because it has the highest accuracy value in all cases.
In the regression problem, random forest and stochastic gradient boosting model have given the same RMSE values.
It can be said that random forest and stochastic gradient boosting models gives better prediction results than penalized regression approach and decision trees.